Word embedding is a collective term for unsupervised ML models that learn to map a set of words $w$ in a vocabulary (or phrases, stems, lemmas) to vectors of numerical values. This approach reduces the number of dimensions from the number of unique words $V$ (i.e. vocabulary) to a much lower value $N$ where dimensions are shared by all words, such that vectors $w^{\prime}$ are not orthogonal any more. In addition, the way embeddings are computed, the ML models discover patterns in the word relations (such as given a context).
Given a word vocabulary $W=\{w_i\}$, $\|\texttt{Vocabulary}\|=V$, each word is represented by a unit vector (e.g. one-hot encoding in ML) such as
$w_1=\begin{bmatrix}1 \\ 0 \\ \vdots \\ 0\end{bmatrix}$, $w_2=\begin{bmatrix}0 \\ 1 \\ \vdots \\ 0\end{bmatrix}$, $w_N=\begin{bmatrix}0 \\ 0 \\ \vdots \\ 1\end{bmatrix}$, $W\in \mathbb{R}^{V \times V}$
Then the word embedding maps this representation $w_i$, $i=1,2,\dots,V$ to another representation (through Skip-gram or CBOW modeling) where the vectors are in a smaller dimension $N$, and $N << V$, and vectors $w^{\prime}_i$, $i=1,2,\dots,V$, with $\|W^{\prime}\|=N$, and $W^{\prime}\in \mathbb{R}^{N \times V}$
Both Skip-Gram and Continuous Bag of Words (CBOW) models use a neural network architecture to model the word mapping from the original $W$ to the embedding $W^{\prime}$
Word representation represents the word in a vector space $W^{\prime}$ so that if the word vectors are close to one another means that those words are related to one other.
Recall that above representation of words also used in the Tf-Idf features matrix where each column is a word and each row is a document or a sentence.
Note: Check
An n-gram is a contiguous sequence of n items from a given sample of text. The items can be letters, syllables, or words according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams is also called shingles.
Text similarity:
Define the context referring to a symmetric window centered on the target word $w_t$, containing the surrounding tokens at a distance less than some window size $\texttt{ws}$ : $C_t = \{w_k | k \in [t-\texttt{ws}, t+\texttt{ws}]\}$
Skip-gram model predicts the context words within a specific window given the current word. The input layer of the neural network model uses the current word and the output layer uses the context words. The hidden layer contains nodes matching the number of dimensions $N$.
Skip-gram learns by predicting the context for a given target word maximizing $\prod\limits_{t=1}^{T}P(C_t|w_t)$.
"The man who passes the sentence should swing the sword." Ned Stark
Sliding window size $\texttt{ws}= 5$
| Window | Target | Context |
|---|---|---|
[The,man,who] |
the | man,who |
[The,man,who,passes] |
man | the,who,passes |
[The,man,who,passes,the] |
who | the,man,passes,the |
[man,who,passes,the,sentence] |
passes | man,who,the,sentence |
[sentence,should,swing,the,sword] |
swing | sentence,should,the,sword |
[should,swing,the,sword] |
the | should,swing,sword |
[swing,the,sword] |
sword | swing,the |
Continuous Bag of Words model predicts the current word given the context words within a specific window. The input layer uses the context words and the output layer uses the current word. The hidden layer is of length $N$. CBOW is the opposite of Skip-gram.
The CBOW model tries to predict the target word given its context, maximizing the likelihood $\prod\limits_{t=1}^{T}P(w_t|C_t)$
To model Skip-gram or CBOW probabilities, a Softmax activation is used on top of the inner product between a target vector $\texttt{u}_{wt}$ and its context vector $\frac{1}{C_t}\sum_{w \in C_t}\texttt{v}_w$
Disadvantage: A limitation of word embeddings is that possible multiple meanings of a word are conflated into a single representation (unlike Wordnet which this knowledge is carried in the graph)
Solution: Develop a data-driven WordNet such as based on word embeddings.
"You shall know a word by the company it keeps." J.R. Firth
Word2vec techniques use the context of a given word to learn its semantics. Also, Word2vec learns numerical representations of words by looking at the words surrounding a given word.
Imagine in an exam and the following sentence is encountered: "Mary is a very stubborn child. Her pervicacious nature always gets her in trouble." What does pervicacious mean?. The phrases surrounding the word of interest is important. In our example, pervicacious is surrounded by stubborn, nature, and trouble. These three words is enough to determine that pervicacious in fact means a state of being stubborn.
Gensim = "Generate Similar" is a topic modeling library to implement Latent Semantic Methods and it is license under GNU LGPLv2.1 license.
Word2Vec module of gensim can generate CBOW and skip-gram models. Here is the
API.
Note that Word2Vec is not removing stop words because the algorithm relies on the broader context of the sentence in order to produce high-quality word vectors.
Word2vec processes text by vectorizing words and generates feature vectors that represent words in the corpus. Similarly, Doc2Vec processes the entire variable-length document by vectorizing documents into fixed length feature vectors. Doc2Vec uses a similar scheme to Word2Vec by extending the skip-gram or CBOW with an additional document/paragraph vector $D$. During the training of words as in Word2Vec the document vector $D$ is also trained and thus it represents the document. Here is the API.
A conda environment is a directory that contains a specific collection of conda packages that is installed. As an example, one environment with NumPy 1.7 and its dependencies, and another environment with NumPy 1.6 for legacy testing can exist on the same computing platform. When one environment is updated, others are not affected. We can easily activate or deactivate environments, to switch between environments.
Before running the jupyter notebook we have to activate the environment. In the following example we used the name gensim as the environment name, activated it, and then installed the gensim library in that environment.
conda create -n gensimconda activate gensimconda install gensimconda install -c conda-forge python-levenshteinconda install jupyterconda install nltk and thereafter every necessary libraryDeactivating the environment: source deactivate
Deleting the environment: conda remove -n gensim -all
Let's load nltk dataset named abc and use gensim to generate word embeddings with CBOW approach.
%%time
import nltk
import gensim
print(f'gensim version= {gensim.__version__}')
from gensim.models import Word2Vec
from nltk.corpus import abc
sents = list(abc.sents())
model = Word2Vec(abc.sents(), min_count=2, workers=4)
X = list(model.wv.index_to_key)
# Sanity
print(f'ABC dataset has {len(sents)} sentences')
print(f'gensim model vocabulary has {len(X)} words mapped to N= {model.vector_size} dimensions')
gensim version= 4.3.0 ABC dataset has 29059 sentences gensim model vocabulary has 19484 words mapped to N= 100 dimensions CPU times: total: 11 s Wall time: 8.04 s
# The closest words to the word 'science'
science = model.wv.most_similar('science')
print(science)
[('agriculture', 0.962886393070221), ('Coalition', 0.9452241063117981), ('law', 0.943227231502533), ('management', 0.9409330487251282), ('textile', 0.9401405453681946), ('biosecurity', 0.9397327899932861), ('general', 0.9383535981178284), ('descend', 0.9369604587554932), ('bulk', 0.936199963092804), ('education', 0.9359628558158875)]
# Distance between computer and science
science12 = model.wv.similarity('science', 'computer')
print(science12)
0.7867609
Let's see another example from Shakespeare's play Hamlet using CBOW and skip-gram methods, respectively.
from nltk.corpus import gutenberg
sents = list(gutenberg.sents('shakespeare-hamlet.txt'))
print(sents[0])
['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']']
%%time
# CBOW model
model1 = Word2Vec(sents, vector_size=200, sg=0, window=13, min_count=1, epochs=20, workers=4)
# Skip-gram model
model2 = Word2Vec(sents, vector_size=200, sg=1, window=13, min_count=1, epochs=20, workers=4)
CPU times: total: 5.98 s Wall time: 1.92 s
Let's find out what the model tells us when the context is the names: ['Hamlet', 'Ophelia', 'Ghost']
similarities1b = model1.wv.most_similar(positive=['Hamlet'], topn=20)
similarities1 = model1.wv.most_similar(positive=['Hamlet', 'Ophelia', 'Ghost'], topn=20)
similarities2 = model2.wv.most_similar(positive=['Hamlet', 'Ophelia', 'Ghost'], topn=20)
# Clean the stop words
def filter_words(_sim):
from nltk.corpus import stopwords
import re
stop_words = set(stopwords.words('english'))
return [(w,p) for w,p in _sim if w.lower() not in stop_words and re.search(r'^[a-zA-Z]{3,}$',w) != None]
similarities1b = filter_words(similarities1b)
similarities1 = filter_words(similarities1)
similarities2 = filter_words(similarities2)
for (w1,s1),(w1b,s1b),(w2,s2) in zip(similarities1, similarities1b, similarities2):
print(f'{w1:16s}{s1:.3f}\t\t{w1b:16s}{s1b:.3f}\t\t{w2:16s}{s2:.3f}')
Horatio 0.993 Ghost 0.992 Rosincrane 0.908 Manet 0.991 Manet 0.986 Manet 0.852 Reynoldo 0.988 Horatio 0.986 Claudius 0.845 goodnight 0.988 Ophelia 0.985 Voltumand 0.842 shout 0.988 goodnight 0.979 Sister 0.840 Thankes 0.987 shout 0.978 Attendant 0.835 twaine 0.987 Noise 0.978 Queene 0.828 Voltemand 0.986 Rosincrane 0.978 Marcellus 0.818 Rosincrane 0.986 Saylor 0.977 Guildenstern 0.818 afarre 0.985 Reynoldo 0.976 bloody 0.818 vnknowne 0.985 Voltemand 0.976 Polonius 0.811 standing 0.984 twaine 0.976 Laertes 0.810 Noise 0.984 Thankes 0.975 Welcome 0.808 Lights 0.984 sickly 0.975 Osricke 0.807 Osricke 0.984 afarre 0.974 Drumme 0.807 Goodnight 0.984 Laertes 0.974 Gertrude 0.806 Ruine 0.984 standing 0.974 Coffin 0.802 Barnardo 0.983 vnknowne 0.973 Farewell 0.800
# The word embedding matrix
words1a = [w for w,s in similarities1] + ['Hamlet']
X1a = model1.wv[words1a]
words2a = [w for w,s in similarities2] + ['Hamlet']
X2a = model2.wv[words2a]
# Sanity
print(X1a.shape)
(20, 200)
We can use classical projection methods to reduce the high-dimensional word vectors to two-dimensional plots using PCA. The visualizations can provide a qualitative diagnostic for our learned model.
Let's train a projection method on the vectors.
from sklearn.decomposition import PCA
pca_model1a = PCA(n_components=2).fit_transform(X1a)
pca_model2a = PCA(n_components=2).fit_transform(X2a)
%matplotlib inline
import matplotlib.pyplot as plt
def plot_pca(_pca_model, _words, _title):
plt.scatter(_pca_model[:, 0], _pca_model[:, 1])
for i, word in enumerate(_words):
plt.annotate(word, xy=(_pca_model[i, 0], _pca_model[i, 1]), c=('r' if word=='Hamlet' else 'k'))
plt.title(_title)
plt.figure(figsize=(12, 6), dpi=72)
ax=plt.subplot(1, 2, 1)
plot_pca(pca_model1a, words1a, 'CBOW Model')
ax=plt.subplot(1, 2, 2)
plot_pca(pca_model2a, words2a, 'Skip-gram Model')
plt.show()
dissimilarities1 = model1.wv.most_similar(negative=['Hamlet'], topn=20)
dissimilarities2 = model2.wv.most_similar(negative=['Hamlet'], topn=20)
words1b = ['Hamlet'] + [w for w,s in similarities1] + [w for w,s in dissimilarities1]
X1b = model1.wv[words1b]
words2b = ['Hamlet'] + [w for w,s in similarities2] + [w for w,s in dissimilarities2]
X2b = model2.wv[words2b]
pca_model1b = PCA(n_components=2).fit_transform(X1b)
pca_model2b = PCA(n_components=2).fit_transform(X2b)
plt.figure(figsize=(20, 10), dpi=72)
ax=plt.subplot(1, 2, 1)
plot_pca(pca_model2a, words2a, 'CBOW Model')
ax=plt.subplot(1, 2, 2)
plot_pca(pca_model2b, words2b, 'Skip-gram Model')
plt.show()
print(model1.wv.most_similar(positive=['Alas', 'poor', 'Yorick', 'Horatio'], topn=20))
[('sweet', 0.995482325553894), ('Gertrude', 0.9953031539916992), ('Sister', 0.9948393106460571), ('Roughly', 0.9948161840438843), ('false', 0.9947570562362671), ('O', 0.99473637342453), ('Oh', 0.9947359561920166), ('yong', 0.994587242603302), ('Are', 0.9944217205047607), ('ioyes', 0.9943819642066956), ('cheerefully', 0.9943240284919739), ('!', 0.9942605495452881), ('hoa', 0.9941666722297668), ('awake', 0.9941198825836182), ('sore', 0.9940646290779114), ('teares', 0.9940516352653503), ('home', 0.9938029050827026), ('looke', 0.9936867356300354), ('exception', 0.9936301112174988), ('Scull', 0.9935811161994934)]
print(model1.wv.most_similar(negative=['Alas', 'poor', 'Yorick', 'Horatio'], topn=20))
[('Conference', 0.9336201548576355), ('Hiperion', 0.9247527122497559), ('Siluer', 0.9054037928581238), ('range', 0.8802468776702881), ('Happy', 0.8748757243156433), ('Satyre', 0.8745262622833252), ('threats', 0.8663226366043091), ('scann', 0.8452069759368896), ('Scourge', 0.8443164825439453), ('Rood', 0.8409736752510071), ('months', 0.8404901027679443), ('punish', 0.8375216126441956), ('Sphere', 0.8307065367698669), ('truster', 0.830676257610321), ('Minister', 0.7938308715820312), ('Lunacie', 0.7803285121917725), ('trifling', 0.7726242542266846), ('space', 0.7693579792976379), ('vttered', 0.7597777843475342), ('Greefes', 0.7589038014411926)]
Do you notice any meaningful word from the context ['Alas', 'poor', 'Yorick', 'Horatio'] as in above?
What is that 'O'?
In the previous lectures we used Tf-Idf features and classified six news categories in Reuters corpus. Now let's see how we could use Word2Vec generated features.
The word embeddings being size $N$, given a set of vector embeddings $v_i, i=0\dots k$ from a document $d$ with $k$ words, and its feature vector $\text{fv}$,
# borrowed from previous lectures
from nltk.corpus import reuters
from collections import Counter
import numpy as np
import pandas as pd
Documents = [reuters.raw(fid) for fid in reuters.fileids()]
# Categories are list of lists since each news may have more than 1 category
Categories = [reuters.categories(fid) for fid in reuters.fileids()]
CategoriesList = [_ for sublist in Categories for _ in sublist]
CategoriesSet = np.unique(CategoriesList)
print(f'N documents= {len(Documents):d}, K unique categories= {len(CategoriesSet):d}')
counts = Counter(CategoriesList)
counts = sorted(counts.items(), key=lambda pair: pair[1], reverse=True)
# Build the news category list
yCategories = [_[0] for _ in counts[:5]]
yCategories += ['other']
# Sanity check, M=29K
print(f'K categories for classification= {len(yCategories):d} {yCategories}')
N documents= 10788, K unique categories= 90 K categories for classification= 6 ['earn', 'acq', 'money-fx', 'grain', 'crude', 'other']
# Assign a category for each news text
yCat = []
for cat in Categories:
bFound = False
for _ in yCategories:
if _ in cat:
yCat += [_]
bFound = True
break # So we add only one category for a news
if not bFound:
yCat += ['other']
# Sanity check
print(f'N categories= {len(yCat):d}')
N categories= 10788
# Convert to numerical np.array which sklearn likes
ydocs = np.array([yCategories.index(_) for _ in yCat])
from nltk import word_tokenize
Sentences = [word_tokenize(doc) for doc in Documents]
%%time
# CBOW model
model = Word2Vec(Sentences, vector_size=300, sg=0, window=9, min_count=1, epochs=20, workers=4)
CPU times: total: 1min 5s Wall time: 17.8 s
# Use the mean of word vector that makes up a sentence or a document
# Note that there are better ways to use the word vector as a feature vector - such as doc2vec in gensim
Xdocs = np.array([np.mean([model.wv[word] for word in doc], axis=0) for doc in Sentences])
print(Xdocs.shape)
(10788, 300)
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
def kfold_eval_docs(_clf, _Xdocs, _ydocs):
# Need indexable data structure
acc = []
kf = StratifiedKFold(n_splits=10, shuffle=False, random_state=None)
for train_index, test_index in kf.split(_Xdocs, _ydocs):
_clf.fit(_Xdocs[train_index], _ydocs[train_index])
y_pred = _clf.predict(_Xdocs[test_index])
acc += [accuracy_score(_ydocs[test_index], y_pred)]
return np.array(acc)
%%time
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
acc = kfold_eval_docs(nb, Xdocs, ydocs)
print(f'Naive Bayes CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
Naive Bayes CV accuracy= 0.693 ±0.032 CPU times: total: 328 ms Wall time: 306 ms
%%time
from sklearn.ensemble import RandomForestClassifier
n_cores = 8
rf = RandomForestClassifier(n_jobs=n_cores, n_estimators=300, max_depth=10, random_state=None, class_weight='balanced')
acc = kfold_eval_docs(rf, Xdocs, ydocs)
print(f'Random Forest CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
Random Forest CV accuracy= 0.885 ±0.015 CPU times: total: 9min 4s Wall time: 1min 10s
%%time
from sklearn.svm import SVC
svm = SVC(kernel='rbf', gamma='scale', class_weight='balanced')
acc = kfold_eval_docs(svm, Xdocs, ydocs)
print(f'Support Vector Machine CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
Support Vector Machine CV accuracy= 0.889 ±0.013 CPU times: total: 26.1 s Wall time: 26.1 s
%%time
import warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import LogisticRegression
# To avoid non-convergence one has to increase 'max_iter' parameter
lr = LogisticRegression(solver='sag', multi_class='auto', max_iter=500, class_weight='balanced')
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=ConvergenceWarning)
acc = kfold_eval_docs(lr, Xdocs, ydocs)
print(f'Logistic Regression CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
Logistic Regression CV accuracy= 0.899 ±0.009 CPU times: total: 2min 20s Wall time: 2min 8s
Notice In this seen-before problem, the classification performance is almost as good as the previous lectures and it runs much faster since we have only 300 vector size (original M was 29016).
In the previous cell we used Word2Vec features and classified six news categories in Reuters corpus. Now let's see how we could use Doc2Vec generated features.
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
# Doc2Vec expects TaggedDocument input data structure, every document is a list of words and tagged with int ID
DocumentsTagged = [TaggedDocument(word_tokenize(reuters.raw(fid)), [i]) for i, fid in enumerate(reuters.fileids())]
model2 = Doc2Vec(DocumentsTagged, vector_size=100, window=9, min_count=1, epochs=20, workers=4)
# Build X from Doc2Vec document vectors
Xdocs2 = np.array([model2.dv[_.tags[0]] for _ in DocumentsTagged])
print(Xdocs2.shape)
(10788, 100)
rf = RandomForestClassifier(n_jobs=n_cores, n_estimators=300, max_depth=10, random_state=None, class_weight='balanced')
acc = kfold_eval_docs(rf, Xdocs2, ydocs)
print(f'Random Forest CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
acc = kfold_eval_docs(svm, Xdocs2, ydocs)
print(f'Support Vector Machine CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
Random Forest CV accuracy= 0.730 ±0.015 Support Vector Machine CV accuracy= 0.774 ±0.018
# Try another model
model3 = Doc2Vec(DocumentsTagged, vector_size=100, dm=0, window=9, min_count=1, epochs=20, workers=4)
Xdocs3 = np.array([model3.dv[_.tags[0]] for _ in DocumentsTagged])
print(Xdocs3.shape)
rf = RandomForestClassifier(n_jobs=n_cores, n_estimators=300, max_depth=10, random_state=None, class_weight='balanced')
acc = kfold_eval_docs(rf, Xdocs3, ydocs)
print(f'Random Forest CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
acc = kfold_eval_docs(svm, Xdocs3, ydocs)
print(f'Support Vector Machine CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
(10788, 100) Random Forest CV accuracy= 0.882 ±0.012 Support Vector Machine CV accuracy= 0.911 ±0.012
Notice In this approach the classification performance is almost as good as the previous results (or better) and it runs much faster since we have only 100 vector size (original M was 29016).
Exercise 1. So many different approaches can be utilized using the Word2Vec generated word vectors and document vectors, such as generating the minimum and maximum vector values (magnitude-wise), doubling the dimension of the feature vectors (from $M$ to $2M$), etc.
Top2vec and compare to Doc2vec%%html
<style>
table {margin-left: 0 !important;}
</style>
<!-- Display markdown tables left oriented in this notebook. -->